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There have been many attempts made to classify breast cancer data, since 
this classification is critical in a wide variety of applications related to the 
detection of anomalies, failures, and risks. In this study machine learning 
(ML) models are reviewed and compared. This paper presents the 
classification of breast cancer data using various ML models. The 
effectiveness of models comparatively evaluated through result using 
benchmark of accuracy which was not done earlier. The models considered 
for the study are k-nearest neighbor (kKNN), decision tree classifier, support 
vector machine (SVM), random forest (RF), SVM kernels, logistic 
regression, Naive Bayes. These classifiers were tested, analyzed and 
compared with each other. The classifier, decision tree, gets the highest 
accuracy i.e. 97.08% among all these models is termed as the best ML 
algorithm for the breast cancer data set. 
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1. INTRODUCTION 

Artificial intelligence (AD) is basically a combination of engineering, computer science and other 
applied sciences. It has the capability to solve complex problems that would traditionally require human 
intelligence [1]. In this field we can say that machine learning (ML) has progressed rapidly and it can be 
assumed that ML can solve many real-world problems. With respect to this researcher are researching on the 
first presentations of AI solutions to tumor imaging and classifying that can produce a major technology 
change in the field of oncology [2]-[6]. ML is the part of AI which deals with scientific study of statistical 
model and algorithms that computer uses for doing its specific task. Current study focuses on breast cancer data 
classification by means of popular ML models. 

Section 2 focuses on related study and its progress in classification of disease. Also discuss 
about various ML models. Section 3 explain the proposed model. Section 4 highlights about method used. 
Section 4 focus on result analysis and discussion. Section 5 is devoted for conclusion. 


2. PREVIOUS BACKGROUND 

Rahman and Muniyandi [7] have proposed a method to classify cancer with more accuracy and 
efficiency using a two-step feature selection (FS) technique combined with the 15-neuron neural network. In 
their research they have utilized dataset from the Wisconsin diagnostic breast cancer, in which the FS 
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technique was used to reduce the number of attributes while the 15-neuron neural network classified the 
cancer. The FS technique helped reduce the data dimensionality there by increasing the accuracy of the 
classification to 99.4% which significantly higher compared to the existing papers. Tanha et al. [8] have tried 
to identify the various relationship among the various factors of different breast cancer groups. Data mining 
techniques and classification algorithms on dataset containing 624 patients from Iran. Using binary tree and 
rule-based classification, a model is developed using patterns discovered during training and testing, while 
establishing the significant relationships among the different prognostic indices. 

A new method for classifying cancer has been proposed [9]. Naive Bayes classification technique is 
considered to be very accurate due to the assumption that parameters and predictors are conditionally 
independent. But this also led to a loss of accuracy. The authors have proposed a method which uses 
hoeffding tree method for normal classification and then Naive Bayes classification for reducing Data 
dimensionality. This separation technique proved to be quite effective and was able to identify input as 
benign, non-benign and normal with significant accuracy. How the use of anthropometric data and collected 
routine blood analysis parameters can estimate the prospective of having Breast cancer have been discussed 
in the study [10]. By using artificial neural network (ANN) and Naive Bayes classification on data and using 
routinely acquired and controlled parameters, the accuracy of diagnostics is notably high. Therefore, using 
either ANN or Naive Bayes technique it is possible to detect breast cancer early. 

The various ML techniques and study their levels of accuracy in FS have been collated the study 
[11]. The various Techniques were used on the dataset from the Wisconsin diagnostic breast cancer and 
appraise the accuracy of runtime results, classification test and standardized evaluation. The six ML 
techniques discussed in the paper include classification and regression techniques. The authors have 
attempted to fluctuate the various datasets and apply the various techniques to the sub-datasets. 80% of the 
dataset was used in training and 20% was used in testing while accuracy was acquired from voting classifier. 
All ML techniques showed more than 90% accuracy at various levels of classification, especially on subsets 
of data. Darabi et al. [12] have discussed how using deep learning techniques can lead to, various methods 
for automatic detection of breast cancer using histopathological images. Authors have proposed a well- 
founded model based on deep transfer learning in which the deep convolutional neural network (DCNN) is 
pre-conditioned using a well-constructed and notable assortment of ImageNet dataset and use data 
augmentation in order to uncover malignant or benign sample tissues. This includes both binary and 
multiclass classification and by examining the model using optimized hyper-parameters. They have 
developed a significantly effectual transfer learning architecture with pre-conditioned DenseNet121 and 
ResNet50 models combined with a fully-connected classifier. Authors have used hybrid minimum 
redundancy maximum relevance (mMRMR) and random subset feature selection (RSFS) algorithms for feature 
selection with k-nearest neighbor (kKNN) and support vector machines (SVMs) algorithms as classifier where 
accuracy of 77.41% and 73.07%, and sensitivity of 98% and 72.72%, respectively is achived [13]. 

Yan et al. [14] propound a technique using histopathological images to identify breast cancer cells 
which uses new hybrid convolutional and recurrent deep neural network. Proposed method uses recurrent 
neural network (RNN) extractor, which is second to the convolutional neural network (CNN) extractor, while 
considering the short- and long-term spatial correlations between the pathological image patches. Authors 
have also made the dataset they worked to the public, and is one of the largest datasets available to the public 
and consists of 3771 breast cancer histopathological images, which are quite inclusive, diverse and covering 
various subclasses. In [15] Deep learning allows multi task learning that reduces the losses by applying 
different algorithms. Deep learning also allows multimodal learning that is process to integrate different 
types of data. In this process AI helps in combination of different types of data from different data source. By 
using deep learning technique, we can do quantitative examination of number of patient’s types of techniques 
used. We can also do qualitative assessments. 

If we consider the field of oncology following challenges are observed during the study. 

— Minimum no of datasets available: There are few datasets available with this few dataset the performance 
and accuracy of the model can’t be determined. This non availability of dataset is the biggest barrier that 
is preventing oncologist to use AI in their treatment. 

— Applying nonlinear techniques: If we apply nonlinear techniques of ML algorithm it may be applied to 
nonlinear phenomenon in oncology but it may degrade the utility of and performance of the dataset 

— Unbiased training: Due to limited datasets it is difficult to conduct unbiased training of ML algorithm 

— Lack of quality datasets: Due to lack of voluminous quality datasets it is difficult to determine the 
accuracy 

— Lack of security: The worse challenge faced by AI in oncology is the privacy and security 

— Lack of accuracy: The blackbox nature of AI model makes it difficult to interpret the accurate result 
which might cause unintended harm 
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— Larger expectations: There have been a misconception of AI capabilities. It is excepted AI to perform 
more efficient task more than its capabilities. 

AI is used to extract phonemics characteristics from medical imaging data that is called radionics 
that has very significant meaning and is the most popular area of research in AI. Radionics started with 
engineering features and ML algorithms. ML builds a mathematical data on the sample data available and 
make predictions to solve the complex tasks. Following is the brief overview of the ML models which are 
used in the classification process during the study. 

Logistic regression is very popular supervised ML algorithm. It is used for predicting categorical 
dependent variable. The output is either 0 or | or yes or no [16]. It gives the probabilistic values that lies 
between 0 and logistic regression is much similar in working with linear regression but its working differs in 
various aspect Linear Regression is used for solving regression problem and logistic regression helps with 
classification problem in logistic regression. In linear regression the model is fitted with straight line and in 
logistic regression the model is fitted with s shape curve [17], [18]. 

The KNN is the simplest and easy supervised ML algorithm that is easy to implement. kKNN are used 
to find k nearest neighbors of data point in dataset. kKNN helps to classify a new datapoint to the nearest class. 
KNN finds the similarity between new cases and the available cases and put the new classes to the category 
that satisfies the similarity of nearest class. These algorithms are used in many applications like texture 
selection [19], clustering [20], classification [21], pattern recognition [22]—[24]. 

Decision tree is a supervised ML that helps in both classification and regression trees (CART) that is 
classification and regression tree. It is a tree like structure. Tree starts with a root node and each node represents 
a, each leaf node represents a class label and branches represents concurrence of features that lead to class [25]. 
High speed and efficiency are two advantages of decision tree. One can take decision on the characteristics or 
the features of the dataset. It is a pictorial representation of the given conditions and gives us the best solution 
[25]. The decision is performed on the basis of characteristics of the dataset. [26]. Decision tree is a most 
widespread algorithm for estimation, prediction and classifying patterns [27]. 

SVM is the most popular supervised ML algorithm. It is a type of algorithm that use to classify two 
group classification problem. SVM is used in image classification. Goal of SVM algorithm is to create the 
best decision. SVM is used for both classification and regression [28]—[30]. The best decision boundary is 
called hyperplane SVM chooses the extreme points that are called support vectors. Complexity depends on 
support vectors and the accuracy of SVM is increased by reducing the dimensionality of the dataset [30]-[33]. 

SVM algorithm uses a group of mathematical functions. This function is to require data as input and 
transform it to the desired form. The SVM kernel functions returns the scalar product return between the two 
datapoints that has maximum space. SVM kernels that can be used for univariate and multivariate time series 
audits [34]. A multi kernel SVM algorithm is proposed to classify accuracy of brain tumor [35]. 

Naive Bayes a based-on Bayes theorem. It is supervised ML algorithm. It helps to classify problem. 
Prediction of class from unknown dataset is performed using Bayes theorem. It is mainly used for 
classification including high dimensional training set. It is the most effective algorithm that helps in 
constructing fast ML models and makes quick predictions. It is a probabilistic theorem that gives output in 
probability [36]. It is basically a classification that is probability based its prediction does not have surety of 
the output that is output is not much reliable. Sentimental analysis, classifying articles and spam filtration are 
few popular applications of Naive Bayes algorithm [37], [38]. 

Random forest (RF) algorithm forms a family of classification algorithm that is a collection of 
decision tree. The main component of RF is a binary tree that is made using recursive partitioning (RPART). 
RF helps in both classification and regression. It can only classify the data into two classes from the root 
node to its two offspring nodes so that similarity and consistency is maintained. RF is often a group of 
hundreds to thousands of trees, where each tree is developed using a bootstrap example of the original data 
[39]. Random forest trees are different from classification and regression as they are developed non 
deterministically and it uses a two-stage random process. This algorithm is used in spatial prediction [40], to 
assess groundwater potential [41], and Gene Ontology [42]-[44]. 


3. PROPOSED RESEARCH MODEL 

The breast cancer database is a publicly available dataset from the University of California Irvine 
Machine Learning Repository. It gives information on cancerous features such as size, density and the study 
made here is focused primarily on comparing the classification results of tumors into malignant and benign 
with accuracy. Three classification algorithms are performed in this comparative study which are logistic 
regression, RF and decision tree. For each cell nucleus real value features which are 10 are considered. For 
attribute class numbers 0 and | were set to signify benign and malignant tumours respectively. Figure 1| 
explains the classification model used during the comparative study. 
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Figure 1. Classification model 


METHOD AND EXPERIMENT DETAIL 
To implement all ML model wisconsin breast cancer dataset provided by UCI Machine Learning 


Repository is used during the study. It has 699 examples, 2 classes i.e. malignant and benign, and 10 
attributes i.e., concavity of the tumour (severity of concave portions of the contour), radius of the tumour 
(mean of distances from centre to points on the perimeter), texture of the tumour (standard deviation of grey- 


scale 


values), concave points (number of concave portions of the contour), symmetry, fractal dimension 


("coastline approximation" - 1), Smoothness of the tumour (local variation in radius lengths), compactness of 
the tumour (perimeter? / area - 1.0), perimeter of the tumour, area of the tumour Figure 2 shows the snapshot 
of the breast cancer dataset. 


id diagnosi radius_metexture_n perimetet area_mea smoothne compactn concavity concave p symmetry fractal_ditradius_se texture_siperimeterarea_se smoothnecompactn concavity_concave p| 
842302 M 17.99 10.38 «©1228 ~=—1001_-Ss«0.1184 0.2776 ~=—«0.3001. «0.1471 «0.2419 «0.07871 «1.095 (0.9053. «8.589 —153.4 0.006399 0.04904 0.05373 0.01587 
842517 M 20.57 17.77 ~=—«132.9 1326 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 0.5435 0.7339 «3.398 += 74.08 0.005225 0.01308 0.0186 0.0134 
84300903 M 19.69 21.25 130 1203 0.1096 0.1599 0.1974 0.1279 0.2069 0.05999 0.7456 0.7869 += «4,585 94.03 0.00615 0.04006 0.03832 0.02058 
84348301 M 11.42 20.38 77.58 386.1 0.1425 0.2839 0.2414 0.1052 0.2597 0.09744 0.4956 1.156 3.445 27.23, 0.00911 0.07458 0.05661 0.01867) 
84358402 M 20.29 14.34 «= 135.1. + 1297—«0,1003 «0.1328 ~=0.198-0.1043 «0.1809 0.05883 0.7572 0.7813 + +—«5.438. «= 494.44 (0.01149 0.02461 0.05688 0.01885 
843786 M 12.45 15.7 82.57 477.1 (0.1278 = 0.17 «0.1578 0.08089 0.2087 0.07613 0.3345 «0.8902 -«2.217 = 27.19 0.00751 0.03345 0.03672 0.01137 
844359 M 18.25 19.98 119.6 1040 -0.09463 © 0.109 0.1127 © (0.074_-—«0.1794 0.05742 0.4467 (0.7732 3.18 53.91 0.004314 0.01382 0.02254 0.01039 
84458202 M 13.71 20.83 90.2 577.9 0.1189 0.1645 0.09366 0.05985 0.2196 0.07451 0.5835 1.377 ~—«3.856. 50.96 0.008805 0.03029 0.02488 0.01448 
844981 M 13.21.82 87.5 519.8 0.1273 0.1932 0.1859 0.09353 0.235 «0.07389 ~«—«0.3063.-~«S «1.002 «2.406 ~—=s24.32:0.005731 0.03502 0.03553 0.01226 
84501001 M 1246 24.04 «= 83.97 475.9 0.1186 0.2396 «0.2273 (0.08543 © 0.203 0.08243 0.2976 += «1.599 2.039 «23.94 0.007149 0.07217 0.07743 0.01432 
845636 M 16.02 23.24 +~=—'102.7—S= 797.8 0.08206 0.06669 0.03299 0.03323 0.1528 (0.05697 0.3795 1.187 2466 40.51 0.004029 0.009269 0.01101 0.007591 
84610002 M 15.78 17.89 103.6 781 0.0971 0.1292 0.09954 0.06606 0.1842 0.06082 0.5058 0.9849 3.564 54.16 0.005771 0.04061 0.02791 0.01282 
846226 M 19.17 24.8 132.4 1123 0.0974 0.2458 0.2065 0.1118 0.2397 0.078 0.9555 3.568 11.07 116.2 0.003139 0.08297 0.0889 0.0409 
846381 M 15.85 23.95 103.7 782.7 (0.08401 0.1002 0.09938 0.05364 0.1847 0.05338 0.4033 1.078 2.903 36.58 0.009769 0.03126 0.05051 0.01992 
84667401 M 13.73 22.61 93.6 578.3 0.1131 0.2293 «0.2128 0.08025 0.2069 (0.07682 0.2121:« «1.169 += 2.061 ~—«19.21 0.006429 0.05936 0.05501 0.01628 
84799002 M 1454 27.54 «= 96.73,» 658.8 0.1139 0.1595 0.1639 0.07364 0.2303 «0.07077 «Ss«.37=S«1.033 = 2.879 32.55 0.005607 0.0424 0.04741 0.0109 
848406 M 14.68 20.13 94.74 684.5 0.09867 0.072 0.07395 0.05259 0.1586 0.05922 0.4727 1.24 3.195 45.4 0.005718 0.01162 0.01998 0.01109 
84862001 M 16.13 2068 1081 7988 0.117 0.2022 0.1722 0.1028 0.2164 0.07356 0.5692 1.073 3.854 54.18 0.007026 0.02501 0.03188 0.01297 
849014 M 19.81 22.15 130 1260 -0,09831 (0.1027 «(0.1479 (0.09498 0.1582 0.05395 0.7582 «1.017 Ss“ 5.865 «112.4 0.006494 0.01893 0.03391 0.01521 
8510426 B 13.54 14.36 87.46 566.3 0.09779 0.08129 0.06664 0.04781 0.1885 0.05766 0.2699 0.7886 2.058 23.56 0.008462 0.0146 0.02387 0.01315 
8510653 B 13.08 15.71 85.63 520 0.1075 0.127 0.04568 0.0311 0.1967 0.06811 0.1852 0.7477 1.383 14.67 0.004097 0.01898 0.01698 0.00649 
8510824 B 9.504 12.44 60.34 273.9 0.1024 0.06492 0.02956 0.02076 0.1815 0.06905 0.2773 0.9768 1.909 15.7 0.009606 0.01432 0.01985 0.01421 
8511133 M 15.34 14.26 102.5 704.4 (0.1073 0.2135 (0.2077 (0.09756 0.2521 0.07032 0.4388 0.7096 ~=——3.384 44.91 0.006789 0.05328 0.06446 0.02252 

Figure 2. Breast cancer data 

Algorithm steps 

cle The first step is importing the necessary libraries which would be used for building 

the model such as NumPy, pandas, matplotlib, seaborn. 

2. After that, loaded the dataset to integrated development environment. 

Bs Checked if there are any missing or null values. 

4. Found unique values in 'diagnosis' column and replaced 'M' with 1 and 'B' with O in 

column 'diagnosis' 

Dis Created visualizations of diagnosis for better understanding of the data. 

6. Removed highly correlated features and reduced the data frame to 24 columns. 

as Imported sklearn library for building the model and selected features and target for X 

and Y axis. 

8. Next step is Standard Scaling where we are getting the training and test data and 


scaling the data for further analysis and divided the data in testing and training set 
in the ratio of 20:80. 
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9. For classification models, imported all the required libraries and used Linear 
regression, Random Forest, Naive Bayes and decision tree for prediction. 

10. Compared the accuracy of the four models and found the significant models for 
predicting the accurate analysis. 


5. RESULT ANALYSIS AND DISCUSSION 

This section provides detail about the experiments and results. In current study ML models are used 
for classification of the breast cancer whether it is malignant or benign. Table | shows predictive analytics 
that is confusion matrix. Along with this more detail analysis of accuracy, precision, recall, Fl-score and 
support is derived and presented for comparison between different ML models Figures 3(a) to 3(g) describe 
the confusion matrix results obtained for kKNN, decision tree, RF, SVM, Naive Bayes, logistic regression and 
kernel SVM ML algorithm respectively. 


Table 1. Classification result of different ML models 


ML models Performance metrics 
Accuracy Precision _Recall Fl-score Support 
kNN 95.68% 95.02% 97.02% 76.00% 60% 
Decision tree 97.08% 97.28% 97.45% 77.52% 62% 
RF 95.6% 96.24% 97.01% 77.01% 61% 
SVM 95.6% 95.24% 97.00% 76.01% 60% 
Naive Bayes 94.89% 91.28% 100% 70.00% 58% 
Logistic regression 95.6% 96.00% 96.54% 77.34% 61.3% 
Kernel SVM 95.6% 94.01% 98.02% 75.42% 58% 
Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix 
~ 80 80 -80 ~80 
3. ro 3° = o 3 -0 <° 60 
3 - 40 3 40 3 40 3 40 
ae 2 <= 20 
20 - 20 20 - 20 
0 1 
* Predicted label Predicted label * Predicted label ° Predicted label 
(a) (b) (c) (d) 
Confusion Matrix Confusion Matrix Confusion Matrix 
60 80 -80 
3 ° 80 60 3° 60 3° 60 
3 40 3 40 3 -40 
2 = 2 
- + 20 2 20 
0 1 2 
Predicted label ” Predicted label ” Predicted label 
(e) (f) (g) 


Figure 3. Confusion matrix for (a) KNN, (b) decision tree, (c) RF, (d) SVM, (e) Naive Bayes, 
(f) logistic regression, and (g) Kernel SVM 


Figures 4(a) to 4(g) describe the visualization about the receiver operating characteristic curve 
(ROC) curve obtained for kNN, decision tree, RF, SVM, Naive Bayes, logistic regression and kernel SVM 
ML algorithm respectively. The investigation provided shows interesting results. The superlative classifier is 
the decision tree classifier. Its complete performance comes out to be the uppermost i.e. 97.08% than other 
algorithms. Decision tree gives high precision i.e. 97.28% which relates to the low false positive rate. Recall 
shows the proportion of correctly predicted positive to all observation in actual class yes, we got 97.45% 
recall which is good as it’s above 0.5. Fl-score is 77.52% for decision tree and support is 62% which is 
overall good as compared to other ML models. Figure 5 shows describes high accuracy gained with decision 
tree classifier algorithm through comparison on accuracy parameter. 
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Figure 4. ROC curve for (a) KNN, (b) decision tree, (c) RF, (d) SVM, (e) Naive Bayes, 
(f) logistic regression, and (g) Kernel SVM 
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Figure 5. Accuracy comparison of different ML models 


6. CONCLUSION 

Current study focuses on ML models for classification of breast cancer dataset. The models 
considered for the study are KNN, decision tree classifier, SVM, RF, SVM kernels, logistic regression, Naive 
Bayes. A comparative study of all the ML model is presented for Breast Cancer Dataset. In order to measure 
the performance various performance metrics are being used during the result and analysis i.e., accuracy, 
precision, recall Fl-score and support. The obtained results show that decision tree classifier gives good 
accuracy 97.08% w.r.t. other ML models. Decision tree gained the knowledge about that data at the time of 
training itself, due to this feature decision tree is the best classifier for predicting breast cancer dataset. 
Accuracy obtained by applying all algorithm except decision tree is near about 95%. 
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